Search CORE

19 research outputs found

Topics in combinatorial pattern matching

Author: Vildhøj Hjalte Wedel
Publication venue: Technical University of Denmark
Publication date: 01/01/2015
Field of study

Online Research Database In Technology

Sublinear Space Algorithms for the Longest Common Substring Problem

Author: Kociumaka Tomasz
Starikovskaya Tatiana
Vildhøj Hjalte Wedel
Publication venue
Publication date: 01/01/2014
Field of study

Given

m

documents of total length

n

, we consider the problem of finding a longest string common to at least

d \geq 2

of the documents. This problem is known as the \emph{longest common substring (LCS) problem} and has a classic

O(n)

space and

O(n)

time solution (Weiner [FOCS'73], Hui [CPM'92]). However, the use of linear space is impractical in many applications. In this paper we show that for any trade-off parameter

1 \leq \tau \leq n

, the LCS problem can be solved in

O(\tau)

space and

O(n^2/\tau)

time, thus providing the first smooth deterministic time-space trade-off from constant to linear space. The result uses a new and very simple algorithm, which computes a

\tau

-additive approximation to the LCS in

O(n^2/\tau)

time and

O(1)

space. We also show a time-space trade-off lower bound for deterministic branching programs, which implies that any deterministic RAM algorithm solving the LCS problem on documents from a sufficiently large alphabet in

O(\tau)

space must use

\Omega(n\sqrt{\log(n/(\tau\log n))/\log\log(n/(\tau\log n)})

time.Comment: Accepted to 22nd European Symposium on Algorithm

arXiv.org e-Print Archive

Online Research Database In Technology

Time-space trade-offs for lempel-ziv compressed indexing

Author: Bille Philip
Ettienne Mikko Berggren
Gørtz Inge Li
Vildhøj Hjalte Wedel
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2017
Field of study

Given a string

S

, the \emph{compressed indexing problem} is to preprocess

S

into a compressed representation that supports fast \emph{substring queries}. The goal is to use little space relative to the compressed size of

S

while supporting fast queries. We present a compressed index based on the Lempel--Ziv 1977 compression scheme. We obtain the following time-space trade-offs: For constant-sized alphabets; (i)

O(m + occ \lg\lg n)

time using

O(z\lg(n/z)\lg\lg z)

space, or (ii)

O(m(1 + \frac{\lg^\epsilon z}{\lg(n/z)}) + occ(\lg\lg n + \lg^\epsilon z))

time using

O(z\lg(n/z))

space. For integer alphabets polynomially bounded by

n

; (iii)

O(m(1 + \frac{\lg^\epsilon z}{\lg(n/z)}) + occ(\lg\lg n + \lg^\epsilon z))

time using

O(z(\lg(n/z) + \lg\lg z))

space, or (iv)

O(m + occ(\lg\lg n + \lg^{\epsilon} z))

time using

O(z(\lg(n/z) + \lg^{\epsilon} z))

space, where

n

and

m

are the length of the input string and query string respectively,

z

is the number of phrases in the LZ77 parse of the input string,

occ

is the number of occurrences of the query in the input and

\epsilon > 0

is an arbitrarily small constant. In particular, (i) improves the leading term in the query time of the previous best solution from

O(m\lg m)

O(m)

at the cost of increasing the space by a factor

\lg \lg z

. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of

O(m(1+\frac{\lg^{\epsilon} z}{\lg (n/z)}))

. However, for any polynomial compression ratio, i.e.,

z = O(n^{1-\delta})

, for constant

\delta > 0

, this becomes

O(m)

. Our index also supports extraction of any substring of length

\ell

O(\ell + \lg(n/z))

time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search

arXiv.org e-Print Archive

Online Research Database In Technology

Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation

Author: Bille Philip
Cording Patrick Hagge
Gørtz Inge Li
Skjoldjensen Frederik Rye
Vildhøj Hjalte Wedel
Vind Søren
Publication venue
Publication date: 01/01/2016
Field of study

Given a static reference string

R

and a source string

S

, a relative compression of

S

with respect to

R

is an encoding of

S

as a sequence of references to substrings of

R

. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data sets such as genomes and web-data. We initiate the study of relative compression in a dynamic setting where the compressed source string

S

is subject to edit operations. The goal is to maintain the compressed representation compactly, while supporting edits and allowing efficient random access to the (uncompressed) source string. We present new data structures that achieve optimal time for updates and queries while using space linear in the size of the optimal relative compression, for nearly all combinations of parameters. We also present solutions for restricted and extended sets of updates. To achieve these results, we revisit the dynamic partial sums problem and the substring concatenation problem. We present new optimal or near optimal bounds for these problems. Plugging in our new results we also immediately obtain new bounds for the string indexing for patterns with wildcards problem and the dynamic text and static pattern matching problem

arXiv.org e-Print Archive

Online Research Database In Technology

Sparse Text Indexing in Small Space

Author: Bille Philip
Fischer Johannes
Gørtz Inge Li
Kopelowitz Tsvi
Sach Benjamin
Vildhøj Hjalte Wedel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

In this work we present efficient algorithms for constructing sparse suffix trees, sparse suffix arrays and sparse positions heaps for b arbitrary positions of a text T of length n while using only O(b) words of space during the construction. Attempts at breaking the naive bound of Ω(nb) time for constructing sparse suffix trees in O(b) space can be traced back to the origins of string indexing in 1968. First results were only obtained in 1996, but only for the case where the b suffixes were evenly spaced in T. In this paper there is no constraint on the locations of the suffixes. Our main contribution is to show that the sparse suffix tree (and array) can be constructed in O(n log2 b) time. To achieve this we develop a technique, that allows to efficiently answer b longest common prefix queries on suffixes of T, using only O(b) space. We expect that this technique will prove useful in many other applications in which space usage is a concern. Our first solution is Monte-Carlo and outputs the correct tree with high probability. We then give a Las-Vegas algorithm which also uses O(b) space and runs in the same time bounds with high probability when b = O( n). Furthermore, additional tradeoffs between the space usage and the construction time for the Monte-Carlo algorithm are given. Finally, we show that at the expense of slower pattern queries, it is possible to construct sparse position heaps in O(n+ b log b) time and O(b) space

CiteSeerX

Online Research Database In Technology

Fingerprints in compressed strings

Author: Bille Philip
Cording Patrick Hagge
Gørtz Inge Li
Sach Benjamin
Vildhøj Hjalte Wedel
Vind Søren
Publication venue: 'Elsevier BV'
Publication date: 01/06/2017
Field of study

Abstract. The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. In this paper we show how to construct a data structure for a string S of size N compressed by a context-free grammar of size n that answers fingerprint queries. That is, given indices i and j, the answer to a query is the fingerprint of the substring S[i, j]. We present the first O(n) space data structures that answer fingerprint queries without decompressing any characters. For Straight Line Programs (SLP) we get O(logN) query time, and for Linear SLPs (an SLP derivative that captures LZ78 compression and its variations) we get O(log logN) query time. Hence, our data structures has the same time and space complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time O(logN log `) and O(log ` log log `+ log logN) for SLPs and Linear SLPs, respectively. Here, ` denotes the length of the LCE.

CiteSeerX

Online Research Database In Technology

Explore Bristol Research

String Matching with Variable Length Gaps

Author: Aho
Crochemore
David Kofoed Wind
Fredriksson
Hjalte Wedel Vildhøj
Hofmann
Inge Li Gørtz
Knuth
Morgante
Myers
Myers
Myers
Navarro
Navarro
Philip Bille
Thompson
Publication venue
Publication date: 01/01/2010
Field of study

We consider string matching with variable length gaps. Given a string

T

and a pattern

P

consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in

T

that match

P

. This problem is a basic primitive in computational biology applications. Let

m

and

n

be the lengths of

P

and

T

, respectively, and let

k

be the number of strings in

P

. We present a new algorithm achieving time

O(n\log k + m +\alpha)

and space

O(m + A)

, where

A

is the sum of the lower bounds of the lengths of the gaps in

P

and

\alpha

is the total number of occurrences of the strings in

P

within

T

. Compared to the previous results this bound essentially achieves the best known time and space complexities simultaneously. Consequently, our algorithm obtains the best known bounds for almost all combinations of

m

n

k

A

, and

\alpha

. Our algorithm is surprisingly simple and straightforward to implement. We also present algorithms for finding and encoding the positions of all strings in

P

for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201

arXiv.org e-Print Archive

CiteSeerX

Elsevier - Publisher Connector

Crossref

Online Research Database In Technology

Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation

Author: Bille Philip
Christiansen Anders Roy
Cording Patrick Hagge
Gørtz Inge Li
Skjoldjensen Frederik Rye
Vildhøj Hjalte Wedel
Vind Søren
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Online Research Database In Technology

The Hardness of the Functional Orientation 2-Color Problem

Author: Bøg Søren
Stöckel Morten
Vildhøj Hjalte Wedel
Publication venue
Publication date: 01/01/2013
Field of study

We consider the Functional Orientation 2-Color problem, which was introduced by Valiant in his seminal paper on holographic algorithms [SIAM J. Comput., 37(5), 2008]. For this decision problem, Valiant gave a polynomial time holographic algorithm for planar graphs of maximum degree 3, and showed that the problem is NP-complete for planar graphs of maximum degree 10. A recent result on defective graph coloring by Corr\^ea et al. [Australas. J. Combin., 43, 2009] implies that the problem is already hard for planar graphs of maximum degree 8. Together, these results leave open the hardness question for graphs of maximum degree between 4 and 7. We close this gap by showing that the answer is always yes for arbitrary graphs of maximum degree 5, and that the problem is NP-complete for planar graphs of maximum degree 6. Moreover, for graphs of maximum degree 5, we note that a linear time algorithm for finding a solution exists

arXiv.org e-Print Archive

CiteSeerX

The IT University of Copenhagen's Repository

Online Research Database In Technology